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Abstract 

Expectation propagation (EP) is a deterministic approximation algorithm that is 
often used to perform approximate Bayesian parameter learning. EP approximates 
the full intractable posterior distribution through a set of local approximations that 
are iteratively refined for each datapoint. EP can offer analytic and computational 
advantages over other approximations, such as Variational Inference (VI), and is 
the method of choice for a number of models. The local nature of EP appears to 
make it an ideal candidate for performing Bayesian learning on large models in 
large-scale dataset settings. However, EP has a crucial limitation in this context: 
the number of approximating factors needs to increase with the number of data- 
points, N , which often entails a prohibitively large memory overhead. This paper 
presents an extension to EP, called stochastic expectation propagation (SEP), that 
maintains a global posterior approximation (like VI) but updates it in a local way 
(like EP). Experiments on a number of canonical learning problems using syn¬ 
thetic and real-world datasets indicate that SEP performs almost as well as full 
EP, but reduces the memory consumption by a factor of N. SEP is therefore ide¬ 
ally suited to performing approximate Bayesian learning in the large model, large 
dataset setting. 


1 Introduction 

Recently a number of methods have been developed for applying Bayesian learning to large datasets. 
Examples include sampling approximations Q][2]], distributional approximations including stochas¬ 
tic variational inference (SVI) 13 and assumed density filtering (ADF) a, and approaches that mix 
distributional and sampling approximations 00. One family of approximation method has gar¬ 
nered less attention in this regard: Expectation Propagation (EP) EE). EP constructs a posterior 
approximation by iterating simple local computations that refine factors which approximate the pos¬ 
terior contribution from each datapoint. At first sight, it therefore appears well suited to large-data 
problems: the locality of computation make the algorithm simple to parallelise and distribute, and 
good practical performance on a range of small data applications suggest that it will be accurate 
nansEii i. However the elegance of local computation has been bought at the price of prohibitive 
memory overhead that grows with the number of datapoints N, since local approximating factors 
need to be maintained for every datapoint, which typically incur the same memory overhead as the 
global approximation. The same pathology exists for the broader class of power EP (PEP) algo¬ 
rithms lfl2ll that includes variational message passing fl3l . In contrast, variational inference (VI) 
methods M na utilise global approximations that are refined directly, which prevents memory 
overheads from scaling with N. 

Is there ever a case for preferring EP (or PEP) to VI methods for large data? We believe that there 
certainly is. First, EP can provide significantly more accurate approximations. It is well known 
that variational free-energy approaches are biased and often severely so 03 and for particular mod¬ 
els the variational free-energy objective is pathologically ill-suited such as those with non-smooth 
likelihood functions old El. Second, the fact that EP is truly local (to factors in the posterior distri- 
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bution and not just likelihoods) means that it affords different opportunities for tractable algorithm 
design, as the updates can be simpler to approximate. 

As EP appears to be the method of choice for some applications, researchers have attempted to 
push it to scale. One approach is to swallow the large computational burden and simply use large 
data structures to store the approximating factors (e.g. TrueSkill lH8l ). This approach can only 
be pushed so far. A second approach is to use ADF, a simple variant of EP that only requires a 
global approximation to be maintained in memory ltl9ll . ADF, however, provides poorly calibrated 
uncertainty estimates 0 which was one of the main motivating reasons for developing EP in the first 
place. A third idea, complementary to the one described here, is to use approximating factors that 
have simpler structure (e.g. low rank, ESI). This reduces memory consumption (e.g. for Gaussian 
factors from 0(ND 2 ) to O(ND)), but does not stop the scaling with N. Another idea uses EP to 
carve up the dataset (5]j6l using approximating factors for collections of datapoints. This results in 
coarse-grained, rather than local, updates and other methods must be used to compute them. (Indeed, 
the spirit of 0161 is to extend sampling methods to large datasets, not EP itself.) 

Can we have the best of both worlds? That is, accurate global approximations that are derived from 
truly local computation. To address this question we develop an algorithm based upon the standard 
EP and ADF algorithms that maintains a global approximation which is updated in a local way. We 
call this class of algorithms Stochastic Expectation Propagation (SEP) since it updates the global 
approximation with (damped) stochastic estimates on data sub-samples in an analogous way to SVI. 
Indeed, the generalisation of the algorithm to the PEP setting directly relates to SVI. Importantly, 
SEP reduces the memory footprint by a factor of N when compared to EP. We further extend the 
method to control the granularity of the approximation, and to treat models with latent variables 
without compromising on accuracy or unnecessary memory demands. Finally, we demonstrate the 
scalability and accuracy of the method on a number of real world and synthetic datasets. 

2 Expectation Propagation and Assumed Density Filtering 

We begin by briefly reviewing the EP and ADF algorithms upon which our new method is based. 
Consider for simplicity observing a dataset comprising N i.i.d. samples V = {x n }™=i from a 
probabilistic model p{x\Q) parametrised by an unknown D-dimensional vector 0 that is drawn from 
a prior po{0). Exact Bayesian inference involves computing the (typically intractable) posterior 
distribution of the parameters given the data, 

N N 

p{0\v) oc p o {0) p{x n \0) S3 q (8) oc p o (0) f n {6). (1) 

n =1 n =1 

Here q(0) is a simpler tractable approximating distribution that will be refined by EP. The goal of 
EP is to refine the approximate factors so that they capture the contribution of each of the likeli¬ 
hood terms to the posterior i.e. f n {0) ~ p(x n \0). In this spirit, one approach would be to find 
each approximating factor f n {0) by minimising the Kullback-Leibler (KL) divergence between the 
posterior and the distribution formed by replacing one of the likelihoods by its corresponding ap¬ 
proximating factor, KL[p(8\'D)\p(0\'D)f n (8)/p(x n \0)\. Unfortunately, such an update is still in¬ 
tractable as it involves computing the full posterior. Instead, EP approximates this procedure by 
replacing the exact leave-one-out posterior p_ n (0) oc p(0\T>) /p(x n \0) on both sides of the KL 
by the approximate leave-one-out posterior (called the cavity distribution) q_ n (0) oc q{0) / f n (0). 
Since this couples the updates for the approximating factors, the updates must now be iterated. 

In more detail, EP iterates four simple steps. First, the factor selected for update is removed from the 
approximation to produce the cavity distribution. Second, the corresponding likelihood is included 
to produce the tilted distribution p n {0) oc q- n (0)p(x n \0). Third EP updates the approximating 
factor by minimising KL[p n (d)\\q_ n (0) f n (0)\. The hope is that the contribution the true-likelihood 
makes to the posterior is similar to the effect the same likelihood has on the tilted distribution. If the 
approximating distribution is in the exponential family, as is often the case, then the KL minimisation 
reduces to a moment matching step ED that we denote f n (0) proj [p n (0)\/ q_ n .(0). Finally, 

having updated the factor, it is included into the approximating distribution. 

We summarise the update procedure for a single factor in Algorithm^ Critically, the approximation 
step of EP involves local computations since one likelihood term is treated at a time. The assumption 
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Algorithm 1 EP 

Algorithm 2 ADF 

Algorithm 3 SEP 

1: choose a factor f n to refine: 

2: compute cavity distribution 
q- n {6) oc q{9)/f n {9) 

3: compute tilted distribution 
pn(9) oc p(x n \d)q- n (0) 

4: moment matching: 

fn{9) «- proj \p„(9)\/q- n {9) 
5: inclusion: 

q(9) <- q_ n (9)f n (9) 

1 : choose a datapoint x n ~ V\ 

2: compute cavity distribution 
q-n(9) = q{9) 

3: compute tilted distribution 
Pn(6) oc p{x„\9)q- n {9) 

4: moment matching: 

fn{9) <- proj \p n {9)]/q- n {9) 
5: inclusion: 

q(9) •<— q-n(9)f n {9) 

1: choose a datapoint x n ~ V\ 

2: compute cavity distribution 
q-i{9) oc q(9)/f{9) 

3: compute tilted distribution 
p„(9) oc p(x n \9)q- 1 (9) 

4: moment matching: 

f n (9) <- proj [p n (9)\/ g_i(0) 

5: inclusion: 

q{9)^q-^9)f n {9) 

6: implicit update : 

f(9) «- /(f?) 1 -^ f n (9)* 


Figure 1: Comparing the Expectation Propagation (EP), Assumed Density Filtering (ADF), and 
Stochastic Expectation Propagation (SEP) update steps. Typically, the algorithms will be initialised 
using q{9) = po(9) and, where appropriate, f n {9) = 1 or f(9) = 1. 


is that these local computations, although possibly requiring further approximation, are far simpler 
to handle compared to the full posterior p(6\T>). In practice, EP often performs well when the 
updates are parallelised. Moreover, by using approximating factors for groups of datapoints, and 
then running additional approximate inference algorithms to perform the EP updates (which could 
include nesting EP), EP carves up the data making it suitable for distributed approximate inference. 

There is, however, one wrinkle that complicates deployment of EP at scale. Computation of the 
cavity distribution requires removal of the current approximating factor, which means any imple¬ 
mentation of EP must store them explicitly necessitating an O(N) memory footprint. One option 
is to simply ignore the removal step replacing the cavity distribution with the full approximation, 
resulting in the ADF algorithm (Algorithm [2]» that needs only maintain a global approximation in 
memory. But as the moment matching step now over-counts the underlying approximating factor 
(consider the new form of the objective ~KL[q(9)p(x n \9)\ |<j(0)]) the variance of the approxima¬ 
tion shrinks to zero as multiple passes are made through the dataset. Early stopping is therefore 
required to prevent overfitting and generally speaking ADF does not return uncertainties that are 
well-calibrated to the posterior. In the next section we introduce a new algorithm that sidesteps EP’s 
large memory demands whilst avoiding the pathological behaviour of ADF. 


3 Stochastic Expectation Propagation 

In this section we introduce a new algorithm, inspired by EP, called Stochastic Expectation Propaga¬ 
tion (SEP) that combines the benefits of local approximation (tractability of updates, distributability, 
and parallelisability) with global approximation (reduced memory demands). The algorithm can 
be interpreted as a version of EP in which the approximating factors are tied, or alternatively as a 
corrected version of ADF that prevents overfitting. The key idea is that, at convergence, the approx¬ 
imating factors in EP can be interpreted as parameterising a global factor, f{9), that captures the 

average effect of a likelihood on the posterior f(9) N = fn{9) ~ II^=i P( x n\9)- In this 

spirit, the new algorithm employs direct iterative refinement of a global approximation comprising 
the prior and N copies of a single approximating factor, f(9), that is q(9 ) oc f(9) N Po(9)- 

SEP uses updates that are analogous to EP’s in order to refine f{9) in such a way that it captures 
the average effect a likelihood function has on the posterior. First the cavity distribution is formed 
by removing one of the copies of the factor, q_i{9) oc q{9)/f (9). Second, the corresponding 
likelihood is included to produce the tilted distribution p n {9) oc q_x(9)p(x n \9) and, third, SEP 
finds an intermediate factor approximation by moment matching, f n {0) •<— proj \p n (9)\/q_\(9). 
Finally, having updated the factor, it is included into the approximating distribution. It is important 
here not to make a full update since f n (9 ) captures the effect of just a single likelihood function 
p{x n \9). Instead, damping should be employed to make a partial update f{9) /(0) 1_e /„(0) £ . 

A natural choice uses e = 1/N which can be interpreted as minimising KL[p n (0)| \po(9)f(0) N ] 
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in the moment update, but other choices of e may be more appropriate, including decreasing e 
according to the Robbins-Monro condition l22l . 

SEP is summarised in Algorithm^ Unlike ADF, the cavity is formed by dividing out f(9) which 
captures the average affect of the likelihood and prevents the posterior from collapsing. Like ADF, 
however, SEP only maintains the global approximation q(9) since f(9) oc {q{9) / po{9))n and 
q-i{6) oc q{Q) 1 ~^pq(6)n . When Gaussian approximating factors are used, for example, SEP 
reduces the storage requirement of EP from 0(ND 2 ) to 0(D 2 ) which is a substantial saving that 
enables models with many parameters to be applied to large datasets. 


4 Algorithmic extensions to SEP and theoretical results 

SEP has been motivated from a practical perspective by the limitations inherent in EP and ADF. In 
this section we extend SEP in four orthogonal directions relate SEP to SVI. Many of the algorithms 
described here are summarised in Figure [2]and they are detailed in the supplementary material. 

4.1 Parallel SEP: relating the EP fixed points to SEP 

The SEP algorithm outlined above approximates one likelihood at a time which can be computa¬ 
tionally slow. However, it is simple to parallelise the SEP updates by following the same recipe by 
which EP is parallelised. Consider a minibatch comprising M datapoints (for a full parallel batch 
update use M = N). First we form the cavity distribution for each likelihood. Unlike EP these are 
all identical. Next, in parallel, compute M intermediate factors f m (9) •<— proj \p m (9)]/q_i(9). 
In EP these intermediate factors become the new likelihood approximations and the approxima¬ 
tion is updated to q(9) = Po{9)Ti n , m fn(0)Ylm fm(9). In SEP, the same update is used for 
the approximating distribution, which becomes q(9) •<— Po(9)f 0 id{9) N ~ M Tim /m(0) and, by im¬ 
plication, the approximating factor is f new (9) = f o u{0) l ~ M/N IIm=i fm(0) 1/N - One way of 
understanding parallel SEP is as a double loop algorithm. The inner loop produces intermediate 
approximations q m (9) argmin 9 ~KL\p m (9)\\q(9)\, these are then combined in the outer loop: 
q(9) «- argmin g Em=i KL[g(0)||g m (0)] + (N - M)KL[q(9)\\q M {9)]. 

For M = 1 parallel SEP reduces to the original SEP algorithm. For M = N parallel SEP is 
equivalent to the so-called Averaged EP algorithm proposed in (23ll as a theoretical tool to study 
the convergence properties of normal EP. This work showed that, under fairly restrictive conditions 
(likelihood functions that are log-concave and varying slowly as a function of the parameters), AEP 
converges to the same fixed points as EP in the large data limit (N —> oc). 

There is another illuminating connection between SEP and AEP. Since SEP’s approximating factor 
f{9) converges to the geometric average of the intermediate factors f(9) oc [J|‘^ =i urn N 5 SEP 
converges to the same fixed points as AEP if the learning rates satisfy the Robbins-Monro condition 
(22) . and therefore under certain conditions (23), to the same fixed points as EP. But it is still an 
open question whether there are more direct relationships between EP and SEP. 

4.2 Stochastic power EP: relationships to variational methods 

The relationship between variational inference and stochastic variational inference (3} mirrors the 
relationship between EP and SEP. Can these relationships be made more formal? If the moment 
projection step in EP is replaced by a natural parameter matching step then the resulting algorithm 
is equivalent to the Variational Message Passing (VMP) algorithm (24) (and see supplementary 
material). Moreover, VMP has the same fixed points as variational inference ed (since minimising 
the local variational KL divergences is equivalent to minimising the global variational KL). 

These results carry over to the new algorithms with minor modifications. Specifically VMP can be 
transformed into SVMP by replacing VMP’s local approximations with the global form employed 
by SEP. In the supplementary material we show that this algorithm is an instance of standard SVI 
and that it therefore has the same fixed points as VI when e satisfies the Robbins-Monro condition 
sa. More generally, the procedure can be applied any member of the power EP (PEP) 02) family 
of algorithms which replace the moment projection step in EP with alpha-divergence minimization 
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A) Relationships between algorithms 


B) Relationships between fixed points 


par-VMP 


VMP 


PEP 


multiple 
approximating 
factors 



K=1 


SVMP 


AVMP a=-l 

I i 


par-EP 


AEP 

par-SEP 

SEP M = N 

parallel 

M=1 minibatch 
updates 


alpha 

divergence 

updates 



same 

same (stochastic methods) 
same in large data limit 
(conditions apply) 


AEP: Averaged EP par-EP: EP with parallel updates PEP: Power EP VI: Variational Inference 

AVMP: Averaged VMP par-SEP: SEP with parallel updates SEP: Stochastic EP VMP: Variational Message Passing 

EP: Expectation Propagation par-VMP: VMP with parallel updates SVMP: Stochastic VMP 


Figure 2: Relationships between algorithms. Note that care needs to be taken when interpreting the 
alpha-divergence as a —> — 1 (see supplementary material). 


GD , but care has to be taken when taking the limiting cases (see supplementary). These results lend 
weight to the view that SEP is a natural stochastic generalisation of EP. 

4.3 Distributed SEP: controlling granularity of the approximation 

EP uses a fine-grained approximation comprising a single factor for each likelihood. SEP, on 
the other hand, uses a coarse-grained approximation comprising a signal global factor to approx¬ 
imate the average effect of all likelihood terms. One might worry that SEP’s approximation is 
too severe if the dataset contains sets of datapoints that have very different likelihood contribu¬ 
tions (e.g. for odd-vs-even handwritten digits classification consider the affect of a 5 and a 9 on the 
posterior). It might be more sensible in such cases to partition the dataset into I\ disjoint pieces 
{T^k = { x n}ntN k i }(^=i w ith ^ = StLi ar *d use an approximating factor for each partition. 
If normal EP updates are performed on the subsets, i.e. treating p(T>k\9) as a single true factor to be 
approximated, we arrive at the Distributed EP algorithm SE). But such updates are challenging as 
multiple likelihood terms must be included during each update necessitating additional approxima¬ 
tions (e.g. MCMC). A simpler alternative uses SEP/AEP inside each partition, implying a posterior 
approximation of the form q(9) oc po{9) IlfeLi fk{9) Nk with fk{9) Nk approximating p(T>k\G). 
The limiting cases of this algorithm, when K = 1 and K = N, recover SEP and EP respectively. 

4.4 SEP with latent variables 

Many applications of EP involve latent variable models. Although this is not the main focus of the 
paper, we show that SEP is applicable in this case without scaling the memory footprint with N. 
Consider a model containing hidden variables, h n , associated with each observation p(x„. h n \6) 
that are drawn i.i.d. from a prior po(h n ). The goal is to approximate the true posterior over pa¬ 
rameters and hidden variables p(9, {h n }\T>) ex po(9) Yl n Po(h n )p( x n\hm 9)- Typically, EP would 
approximate the effect of each intractable term as p(x n \h n , 9)po(h n ) ss fn(9)g n (h n ). Instead, 
SEP ties the approximate parameter factors p(x n \h n , 9)po(h n ) « f{9)g n (h n ) yielding: 

N 

q{9 , {h n }) a Po(9)f(9) N g n (h n ). (2) 

n =1 

Critically, as proved in supplementary, the local factors g n (h n ) do not need to be maintained in 
memory. This means that all of the advantages of SEP carry over to more complex models involving 
latent variables, although this can potentially increase computation time in cases where updates for 
gn{h n ) are not analytic, since then they will be initialised from scratch at each update. 
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5 Experiments 

The purpose of the experiments was to evaluate SEP on a number of datasets (synthetic and real- 
world, small and large) and on a number of models (probit regression, mixture of Gaussians and 
Bayesian neural networks). 


5.1 Bayesian probit regression 


The first experiments considered a simple Bayesian classification problem and investigated the 
stability and quality of SEP in relation to EP and ADF as well as the effect of using mini¬ 
batches and varying the granularity of the approximation. The model comprised a probit likeli¬ 
hood function P(y n = 1 \0) = <I>(Y/ / x n ) and a Gaussian prior over the hyper-plane parameter 
p{6) = A/”( 6 k 0, 7 /). The synthetic data comprised N = 5,000 datapoints {(x n ,y n )}, where x n 
were D = 4 dimensional and were either sampled from a single Gaussian distribution (Fig. |3(a)| ) or 
from a mixture of Gaussians (MoGs) with J = 5 components (Fig. 3(b)[ ) to investigate the sensitiv¬ 
ity of the methods to the homogeneity of the dataset. The labels were produced by sampling from 
the generative model. We followed J 6 ) measuring the performance by computing an approximation 
ofKL[p{6\V)\\q{0)}, where p(6\V) was replaced by a Gaussian that had the same mean and covari¬ 
ance as samples drawn from the posterior using the No-U-Turn sampler (NUTS) l25l . to quantify 
the calibration of uncertainty estimations. 


Results in Fig. 3(a) indicate that EP is the best performing method and that ADF collapses towards a 


delta function. SEP converges to a solution which appears to be of similar quality to that obtained by 
EP for the dataset containing Gaussian inputs, but slightly worse when the MoGs was used. Variants 
of SEP that used larger mini-batches fluctuated less, but typically took longer to converge (although 
for the small minibatches shown this effect is not clear). The utility of finer grained approximations 
depended on the homogeneity of the data. For the second dataset containing MoG inputs (shown in 
Fig. |3(b)[ >, finer-grained approximations were found to be advantageous if the datapoints from each 
mixture component are assigned to the same approximating factor. Generally it was found that there 
is no advantage to retaining more approximating factors than there were clusters in the dataset. 

To verify whether these conclusions about the granularity of the approximation hold in real datasets, 
we sampled N = 1,000 datapoints for each of the digits in MNIST and performed odd-vs-even 
classification. Each digit class was assigned its own global approximating factor, K = 10. We 
compare the log-likelihood of a test set using ADF, SEP (K = 1), full EP and DSEP (K = 10) 
in Figure 3(c)| EP and DSEP significantly outperform ADF. DSEP is slightly worse than full EP 
initially, however it reduces the memory to 0.001% of full EP without losing accuracy substantially. 
SEP’s accuracy was still increasing at the end of learning and was slightly better than ADF. Further 
empirical comparisons are reported in the supplementary, and in summary the three EP methods are 
indistinguishable when likelihood functions have similar contributions to the posterior. 


Finally, we tested SEP’s performance on six small binary classification datasets from the UCI ma¬ 
chine learning repository^ We did not consider the effect of mini-batches or the granularity of the 
approximation, using K = M = 1. We ran the tests with damping and stopped learning after 
convergence (by monitoring the updates of approximating factors). The classification results are 
summarised in Table [I] ADF performs reasonably well on the mean classification error metric, 
presumably because it tends to learn a good approximation to the posterior mode. However, the pos¬ 
terior variance is poorly approximated and therefore ADF returns poor test log-likelihood scores. EP 
achieves significantly higher test log-likelihood than ADF indicating that a superior approximation 
to the posterior variance is attained. Crucially, SEP performs very similarly to EP, implying that SEP 
is an accurate alternative to EP even though it is refining a cheaper global posterior approximation. 


5.2 Mixture of Gaussians for clustering 

The small scale experiments on probit regression indicate that SEP performs well for fully-observed 
probabilistic models. Although it is not the main focus of the paper, we sought to test the flexibility 
of the method by applying it to a latent variable model, specifically a mixture of Gaussians. A syn¬ 
thetic MoGs dataset containing N = 200 datapoints was constructed comprising J = 4 Gaussians. 

'https://archive.ics.uci.edu/ml/index.html 


6 

















num. iterations 
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Figure 3: Bayesian logistic regression experiments. Panels (a) and (b) show synthetic data experi¬ 
ments. Panel (c) shows the results on MNIST (see text for full details). 


Table 1: Average test results all methods on probit regression. All methods appear to capture the 
posterior’s mode, however EP outperforms ADF in terms of test log-likelihood on almost all of the 


datasets, with SEP performing similarly to EP. 


Dataset 

ADF 

mean error 

SEP 

EP 

ADF 

test log-likelihood 

SEP EP 

Australian 

Breast 

Crabs 

Ionos 

Pima 

Sonar 

0.328±0.0127 

0.037±0.0045 

0.056±0.0133 

0.126T0.0166 

0.242±0.0093 

0.198T0.0208 

0.325±0.0135 

0.034T0.0034 

0.033T0.0099 

0.130±0.0147 

0.244T0.0098 

0.198±0.0217 

0.330±0.0133 

0.034±0.0039 

0.036±0.0113 

0.131±0.0149 

0.241±0.0093 

0.198±0.0243 

-0.634±0.010 

-0.100±0.015 

-0.242±0.012 

-0.373±0.047 

-0.516±0.013 

-0.461±0.053 

-0.631 ±0.009 
-0.094±0.011 
-0.125±0.013 
-0.336±0.029 
-0.514±0.012 
-0.418±0.021 

-0.631T0.009 

-0.093T0.011 

-0.110T0.013 

-0.324T0.028 

-0.513T0.012 

-0.415T0.021 


The means were sampled from a Gaussian distribution, p(fij) = A/"(/x; m , I), the cluster identity 
variables were sampled from a uniform categorical distribution p(h n = j) = 1/4, and each mixture 
component was isotropic p(x n \h n ) = Af(x n ; ph n , 0.5 2 1). EP, ADF and SEP were performed to 
approximate the joint posterior over the cluster means {pj} and cluster identity variables { h n } (the 
other parameters were assumed known). 


Figure 4(a) visualises the approximate posteriors after 200 iterations. All methods return good 
estimates for the means, but ADF collapses towards a point estimate as expected. SEP, in contrast, 
captures the uncertainty and returns nearly identical approximations to EP. The accuracy of the 
methods is quantified in Fig. 4(b) by comparing the approximate posteriors to those obtained from 
NUTS. In this case the approximate KL-divergence measure is analytically intractable, instead we 
used the averaged F-norm of the difference of the Gaussian parameters fitted by NUTS and EP 
methods. These measures confirm that SEP approximates EP well in this case. 


5.3 Probabilistic backpropagation 

The final set of tests consider more complicated models and large datasets. Specifically we eval¬ 
uate the methods for probabilistic backpropagation (PBP) |4|. a recent state-of-the-art method for 
scalable Bayesian learning in neural network models. Previous implementations of PBP perform 
several iterations of ADF over the training data. The moment matching operations required by ADF 
are themselves intractable and they are approximated by first propagating the uncertainty on the 
synaptic weights forward through the network in a sequential way, and then computing the gradient 
of the marginal likelihood by backpropagation. ADF is used to reduce the large memory cost that 
would be required by EP when the amount of available data is very large. 

We performed several experiments to assess the accuracy of different implementations of PBP based 
on ADF, SEP and EP on regression datasets following the same experimental protocol as in |4[ (see 
supplementary material). We considered neural networks with 50 hidden units (except for Year and 
Protein which we used 100). Table[2]shows the average test RMSE and test log-likelihood for each 
method. Interestingly, SEP can outperform EP in this setting (possibly because the stochasticity 
enabled it to find better solutions), and typically it performed similarly. Memory reductions using 
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Figure 4: Posterior approximation for the mean of the Gaussian components, (a) visualises posterior 
approximations over the cluster means (98% confidence level). The coloured dots indicate the tme 
label (top-left) or the inferred cluster assignments (the rest). In (b) we show the error (in F-norm) of 
the approximate Gaussians’ means (top) and covariances (bottom). 


Table 2: Average test results for all methods. Datasets are also from the UCI machine learning 
repository. 

RMSE test log-likelihood 


Dataset 

ADF 

SEP 

EP 

ADF 

SEP 

EP 

Kin8nm 

Naval 

Power 

Protein 

Wine 

Year 

0.098±0.0007 

0.006±0.0000 

4.124T0.0345 
4.727±0.0112 
0.635T0.0079 
8.879± NA 

0.088T0.0009 

0.002T0.0000 

4.165±0.0336 

4.670i0.0109 

0.650±0.0082 

8.922±NA 

0.089±0.0006 

0.004T0.0000 

4.191±0.0349 

4.748±0.0137 

0.637±0.0076 

8.914±NA 

0.896i0.006 

3.731±0.006 

-2.837i0.009 

-2.973±0.003 

-0.968±0.014 

-3.603T NA 

1.013T0.011 

4.590T0.014 

-2.846±0.008 

-2.961i0.003 

-0.976i0.013 

-3.924iNA 

1.005i0.007 
4.207i0.011 
-2.852i0.008 
-2.979i0.003 

-0.958i0.011 

-3.929iNA 


SEP instead of EP were large e.g. 694Mb for the Protein dataset and 65,107Mb for the Year dataset 
(see supplementary). Surprisingly ADF often outperformed EP, although the results presented for 
ADF use a near-optimal number of sweeps and further iterations generally degraded performance. 
ADF’s good performance is most likely due to an interaction with additional moment approximation 
required in PBP that is more accurate as the number of factors increases. 

6 Conclusions and future work 

This paper has presented the stochastic expectation propagation method for reducing EP’s large 
memory consumption which is prohibitive for large datasets. We have connected the new algorithm 
to a number of existing methods including assumed density filtering, variational message passing, 
variational inference, stochastic variational inference and averaged EP. Experiments on Bayesian 
logistic regression (both synthetic and real world) and mixture of Gaussians clustering indicated 
that the new method had an accuracy that was competitive with EP. Experiments on the probabilistic 
back-propagation on large real world regression datasets again showed that SEP comparably to 
EP with a vastly reduced memory footprint. Future experimental work will focus on developing 
data-partitioning methods to leverage finer-grained approximations (DESP) that showed promising 
experimental performance and also mini-batch updates. There is also a need for further theoretical 
understanding of these algorithms, and indeed EP itself. Theoretical work will study the convergence 
properties of the new algorithms for which we only have limited results at present. Systematic 
comparisons of EP-like algorithms and variational methods will guide practitioners to choosing the 
appropriate scheme for their application. 
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The supplementary material is divided into these sections. Section A details the design of stochastic 
power EP methods and presents relationships between SEP and SVI. Section B extends the discus¬ 
sion of distributed algorithms and SEP’s applicability to latent variable models. Section C provides 
experimental details of the Bayesian neural network experiments and presents further emprical eval¬ 
uations of the method. 


A Further theoretical connections 

We described the extensions of stochastic expectation propagation (SEP) in the main text, and we 
provide more details in this section. 

A.l Power EP and alpha-EP 

The relationship between EP and variational inference (VI) can be explained by introducing power 
EP (PEP) [1], Asa preparation let us consider the alpha-divergence 1 introduced in [2] 

D a \p{d)\\q{d)\ = (l - J°p(0)^q(O)^dd^ . (1) 

Two cases of KL-divergence also belongs to the family of alpha-divergence by definition: 

DMmqm = lim D a [p(e)\\q(0)\ = KL\p(d)\\q(d)}, (2) 

a —>-1 

ZZ 1 [p(0)||< / (0)] = lim D a \p(6)M0)] = KL[g(0)||p(0)]. (3) 

Oi —^-1 

Minka [1] also introduced alpha-EP as a generalisation of EP to alpha-divergences, which changes 
the moment matching step to alpha-projection [3] that returns the minimiser of the alpha divergence 
D a [Pn(S)\\q{9)] wrt. q{9). Examples include moment projection proj[-] which takes a = 1, and 
information projection which chooses a = —1. However alpha-projections are difficult to compute 
in general, motivating power EP (Algorithm 1) - so called because it raises potentials to a power 
before referencing standard EP updates - as a practical alternative. Minka [1] showed that power 
EP with power 1//3, /3 < oo returns a local optimum of the alpha divergence with a = —1 + 2 /(3 
when converged. However this still leaves the pathological case a = -1 or /) = oc since the above 
equivalence does not apply. Thus variational message passing (VMP), which takes a —> — 1, cannot 
be interpreted as a special case of power EP. This observation extends to stochastic PEP as well 
(Algorithm 2). Instead we derive stochastic VMP in the spirit which SEP extends EP, which keeps 
the computational steps using current global estimate but ties all the local factors. We discuss this 
extension in detail in the next section and provide its connection to stochastic variational inference. 


'A little math can show the updates of alpha-EP using different existing alpha-divergence definitions are 
equivalent, although the corresponding alpha will change. 


1 




Algorithm 1 PEP 

Algorithm 2 Stochastic PEP 

1: choose a factor /„ to refine: 

1: choose a datapoint x n ~ V\ 

2: compute cavity distribution 

2: compute cavity distribution 

?_ n (0)cxg(0)// n (0) 1 /' 3 

q. 1 (0)cxq(0)/f(0) 1 ' fi 

3: compute tilted distribution 

3: compute tilted distribution 

p„(0) ocp(a; n |0) 1//5 g_ n (0) 

Pn(0) oc p(x n \0) 1//3 q-!(0) 

4: moment matching: 

4: moment matching: 

f n {6) <- [proj [p n (9)\/ < 7 _„( 0 )] /3 

fn(0) <- [proj [p n (0)\/q~i(0)]^ 

5: inclusion: 

5: inclusion: 

q(0)^q(0)fn(0)/f° ld (0) 

q(0)+-q(0)f n (0)/f(0) 

6: implicit update : 

/(0)4-/(0) 1_ */n(0)* 


A.2 Connecting SVMP to SVI 

We first briefly sketch the VMP algorithm using the EP framework, but replacing the moment match¬ 
ing step with natural parameter matching. We assume the approximate posterior q(0) is in some 
exponential family: 

q(6) oc exp [(Ag, 0(0))]. (4) 

At time t we have the current estimate of the natural parameter A* , which is defined as the sum of 

local variational parameters 2 : A* = Ao + A^- Here Ao represents the natural parameter of 

the prior distribution po(0). VMP iteratively computes the update of each local estimate A*, +1 in the 
following procedure. First VMP computes the expected sufficient statistics s n about datapoint x n 
using A*, e.g. s n = E q [t(z n , x n )\ in the original SVI paper [4]. Then VMP forms the gradient as 
though optimising the maximised evidence lower bound (ELBO) but with q_i(0) as the prior: 

Va tC = XU + 4- A*, (5) 

A^i = A* - A/. (6) 

Next VMP zeros the gradient and recovers the current update A/ +l = s n . The stochastic version 
of VMP, if extended in a way as SEP developed from EP, defines the global variational parameters 

as A* = A 0 + NX*. It computes the expected sufficient statistics s n in the same way as VMP but 
changes the cavity to X*_ 1 = X* q — X* in the ELBO maximisation steps. Readers can verify that this 
returns the current update A t+1 = s n using the important fact that the local update ONLY depends 
on the global parameter A‘. Now since we tie all the local updates, the global parameter update 
A* +1 = Ao + AA t+1 = Ao + Ns n . In practice we perform a damped update, where a typical 
choice of step size is e = 1/N like in SEP: 

A^ +1 3 — (1 — —)A * + — (Ao + AA' +1 ) = Ao + (N — 1)A ( + s n . (7) 

On the other hand, [5] summarises stochastic variational inference (SVI) as to compute the current 
update by zeroing the gradient 

Va„£ = Ao + Ns n — X q , (8) 

which returns A* + l = A (J + A'.sy, as well. This implies that SVI, when using learning rate e = 1/A, 
is equivalent to SVMP. 

A.3 SEP from a global approximation perspective 

In this section we provide some intuition about SEP via an interpretation as approximating minimi¬ 
sation of a global divergence like VI (although it is computed in a truly local way). This framework 
utilises alpha-divergence, but on the global posterior, and we motivate it by describing VI and SVI as 

2 This notation implicitly assumes that the prior and the approximate posterior belong to the approximate 
distribution family. In general we can propose another factor to approximate po{6), and our result still applies. 
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Figure 1: (a) A geometric view of AEP/PEP comparison, (b) A cartoon illustration of DEP, SEP and 
DSEP. For each algorithm we show the approximate posterior on the top and the tilted distribution 
at the bottom. 


divergence minimisation. VI performs global optimisation on KL[g(0)| \p(Q\V)\, and its stochastic 
version, SVI, can be interpreted as at each step minimising KL[g(0)||p(0|{a:„} Ar )] with the N repli¬ 
cas {tEn}^ = { x n , x n: ..., x n }. Similarly, we state SEP as a stochastic global optimisation proce¬ 
dure, which computes an iterative procedure to minimise alpha-divergence D a [p(6\{x n } N )\\q(6)\ 
with a = -1 + 2/N. Indeed we can understand the inner-loop of AEP as PEP with power 1/N if 
considering f(9 ) N as a large composite factor to approximate the likelihood term of x„ raised to 
power N. 

However minimising the alpha-divergence between the true posterior p(6\V) and the global ap¬ 
proximation q{6) recovers PEP on the whole dataset instead, and the factor to include in the tilted 

distribution changes to the intractable geometric average avg[{p(a:„|0)}] = 

Readers might have noticed that the update of PEP on the full dataset is given by q(0) e- 
proj [avg[{p n (0)}]]. In other words, we can interpret AEP as an approximation to the impracti¬ 
cal batch PEP by interchanging projections and averaging operations, and we illustrate a geometric 
view for this in Fig. 1(a). 

It is important to note that SEP/AEP at convergence does not minimise the alpha divergence glob¬ 
ally. Like PEP, the inner-loop computes proj [p n (6)\, where one can show that it moves towards 
minimising D a (p(0\{x n } N )\\q n (9)) using the same techniques as before. However the outer-loop 
averages the natural parameters of the intermediate answers, which does not follow the gradient 
direction of alpha-divergence minimisation. Furthermore, local/global optimisation of alpha diver¬ 
gence are inconsistent in terms of fixed points except at a = —1, the divergence utilised in VI and 
VMP Indeed we provide the fixed point conditions of AEP which reveals its local nature. 

Proposition 1. The fixed points of averaged EP, if they exist, can be written as q(6) = avg[{(7 n (0)}], 
where 

qn(0) = proj[p„(0)], (9) 

p n {6) (x q(9)^y^~. (10) 

These fixed points are also the fixed points of stochastic EP when the learning rates satisfy the 
Robbins-Monro condition [6], 

This fixed point condition applies to stochastic PEP as well when a/-l, and importantly it also 
implies the pathology of constructing SVMP by using SPEP and limiting a to —1. 

B Algorithmic design details 

B.l Distributed computing methods 

We have shown in the main paper that a proper design of data partitioning improves SEP’s approxi¬ 
mation accuracy. This distributed algorithm is inspired by the Distributed EP (DEP) algorithm [7, 8] 
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Algorithm 3 DEP 

Algorithm 4 DSEP 

Algorithm 5 DAEP 

1 : compute cavity distribution 
q-k(d) oc q(6)/fk{0 ) 

2: compute tilted distribution 
Pk{0) oc p(T>k\9)q- k (0) 

3: moment matching: 
fk{0) <- proj [p k {0)]/q-k(0) 

1 : compute cavity distribution 
q-i(0) = q(0)/fkW 

2: choose a datapoint x„ ~ T) k 

3: compute tilted distribution 
Pk(0) oc p(x n \0)q-i(0) 

4: moment matching: 

fk(0 ) <- proj[p2(0)]/g_i(0) 
5: inclusion: 

/40) Wfe(0) 1_1/JVfc /fc(0) 1/JVfc 

1 : compute cavity distribution 
q-i(e)xq(0)/f k (0) 

2: for each x n £ V k : 

3: compute tilted distribution 

Pk{0) oc p(x n \9)q-i(6) 

4: moment matching: 

fk(0) P ro j \Pk(0)\l 9-i (0) 

5: inclusion: 

fk(0) Nk <-n n fk(0) 


Figure 2: Comparing the variants of distributed design for Expectation Propagation (EP) on the 
current data piece T) k . One should notice that the definitions of fk(0) are different for DEP and 
DSEP/DAEP Distributed EP (DEP) uses sampling methods to compute the projection step, while 
Distributed SEP/AEP (DSEP/DAEP) still keeps deterministic computations. 


presented in Algorithm 3. DEP first partitions the dataset into I\ disjoint pieces { i) k = {cc;}^*j} 
with N = Xwli -Vfc> which is well-justified since the true posterior can also be derived as 

K 

p(0\V)(xp 0 (6)l[p(V k \d ), (11) 

fc = 1 

P(p k \0)= n P(*nl®)- (12) 

Next DEP assigns factors to each sub-dataset likelihood, i.e. q{6) oc pq{9) Jlfc /fe(^) with eac h 
fk(0) approximating p(T> k \9). The projection step is no longer analytically tractable in general 
since the tilted distribution with multiple datapoints often lacks a simple form. Instead DEP handles 
moment matching with sampling, making it stochastic in the sense of having an stochastic approxi¬ 
mation of the moments. 

To construct a deterministic counterpart of DEP, we consider running SEP/AEP inside each parti¬ 
tion. We name this approach as Distributed SEP/AEP (DSEP/DAEP) and provide a comparison in 
Fig. 1(b) with DEP and SEP on the sub-dataset likelihood factors using sampling protocol. Differ¬ 
ent from DEP, the approximate posterior for DSEP is defined as q(9) oc po(9) ]~[ ; fk{9) Nk , with 
fk{0) Nk approximating p(V k \0). The computations are almost the same as SEP/AEP except that 
the updates only modify the copies of the corresponded subset. These two algorithms are also de¬ 
tailed in Algorithm 4 and 5, respectively. In section C.3 we provide an emprical study on comparing 
SEP, EP and DSEP approximations. 

B.2 SEP with latent variables 

In this section we show the applicability of SEP to latent variables without scaling the memory 
consumption with N. We consider a model containing latent variables h n associated with each 
observation x n , which are drawn i.i.d. from a prior po(h n ). SEP proposes approximations to the 
true posterior over parameters and hidden variables 

p(0,{h n }\V) ocp 0 (d)Y]_Po(h n )p{x n \h n ,d) (13) 

n 

by tying the factors for the global parameter 0 but retaining the local factors for the hidden variables: 

N 

q{6, {h n }) oc p o {0)f{0) N JJ g n (h n ). (14) 

71=1 

In other words, SEP uses f{9)g n (h n ) to approximate p(x n \h n , 9)p 0 (h n ). 

Next we show a critical advantage of SEP when approximating the latent variable posterior distri¬ 
butions: the local factors g n (h n ) do not need to be maintained in memory (though see caveats men¬ 
tioned below). More formally, the cavity distribution is g_ n (0, {/i ra }) oc q(9, {h„})/ (f{9)g n (h n )) 
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Table 1: Datasets Used in the Experiments with Neural Networks. The memory numbers reported 
include dataset storage and temporal maintainance of computation graphs in Theano (~ 100 MB). 


Dataset 

N 

D 

MB (EP) 

MB (SEP) 

MB reduced 

Kin8nm 

8192 

8 

168.23 

109.76 

58.47 

Naval Propulsion 

11,934 

16 

261.75 

113.92 

147.83 

Combined Cycle Power Plant 

9568 

4 

148.70 

110.99 

37.71 

Protein Structure 

45,730 

9 

815.55 

121.52 

694.02 

Wine Quality Red 

1599 

11 

122.21 

107.90 

14.30 

Year Prediction MSD 

515,345 

90 

67837.90 

2730.55 

65107.34 


and the tilted distribution is p n {6, {h n }) oc q- n (0 , { h n })p(x n \h n , d)po(h n ). This leads to the a 
moment-update that minimises 


KL 


Po(0)f(0) N 1 p{x n \h n ,6)p 0 (h n ) g m {h m )\\p 0 (0)f{0) N 1 f l (d)g n (h n ) g m (h m ) 

m^n m^n 


with respect to /' (Q)g n (h n ). Importantly, the terms involving - n 9 m(h m ) cancel, meaning that 
these factors do not contribute to the local approximation step. For simple models the moments of 
h„ can be computed analytically given r/_ i (0), thus g n {h n ) is never stored in memory resulting 
in a reduced memory footprint by a factor of N again. However in practice people may prefer 
maintaining the g factors in memory, if the moment computation requires another optimisation inner- 
loop (which might be more expensive than the moment matching step itself). Examples include 
latent Dirichlet allocation [9] that has a hierachy of latent variables, where VI methods also store 
variational q distributions for some of the hidden variables. One potential recipe in this scenario is 
to learn the moments/messages passed in each SEP step in the spirit of [10, 11], 

It is also possible to have latent variables globally shared or shared in a data subset T>k- But we can 
also extend SEP to these latent variables accordingly, which still provides computation gains in space 
complexity. In mathematical forms, assume hk a latent variable shared in Dj .. Then we construct 
q(hk) oc po(hk)gk(hk) Nk to approximate its posterior. This procedure still reduces memory by a 
factor of N/ K. 


C Further experimental results 

C.l Details of Bayesian neural network tests 

We perform neural network regression experiments with publicly available data sets and neural 
networks with one hidden layer. Table 1 lists the analyzed data sets and shows summary statistics. 
We use neural networks with 50 hidden units in all cases except in the two largest ones, i.e.. Year 
Prediction MSD and Protein Structure, where we use 100 hidden units. The different methods, 
SEP, EP and ADF were run by performing 40 passes over the available training data, updating the 
parameters of the posterior approximation after seeing each data point. The data sets are split into 
random training and test sets with 90% and 10% of the data, respectively. This splitting process is 
repeated 20 times and the average test performance of each method is reported. In the two largest 
data sets. Year Prediction MSD and Protein Structure, we do the train-test splitting only one and 
five times respectively. The data sets are normalized so that the input features and the targets have 
zero mean and unit variance in the training set. The normalization on the targets is removed for 
prediction. 

We also provide the memory consumption details for experiments using probabilistic back- 
propagation (PBP) in Table 1. We observe substantial memory reductions by running SEP instead 
of EP, while still attaining similar accuracies. Especially for Year Prediction MSD dataset, which is 
a typical large-scale dataset both in the number of observations N and the dimensionality D, SEP 
achieves saving tens of gigabytes. We performed the test for EP using a machine with more than 
100GB RAM, while SEP only required 2.7GB memory, including the space of storing the dataset 
(1.9GB). These numbers reveal the huge memory requirement of full EP and further support SEP as 
a practical alternative in big data, big model settings. 
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Figure 3: Performance of EP methods on Bayesian logistic regression with sampling moment com¬ 
putations, measured in approximate KL divergence described in the main text. 


C.2 Stochastic EP with sampling protocal 

Although not a main purpose, we further test the performance of SEP when using sampling methods 
to compute moments 3 . We re-use the settings of probit regression but change the probit unit to 
sigmoid function, making the moment projection analytically intractable. We randomly partition 
the dataset into K = 20 subsets {2? fc }, construct the approximate posterior with local factors over 
the subsets, and tie them in SEP/AEP as before. Note that we perform sequential computations for 
DEP and AEP although they are ideally suited for parallel computing. Again as presented in Figure 
3, SEP performs almost as well as EP, which further justifies SEP even with sampling methods. Also 
AEP is indistinguishable from DEP, but it reduces memory by a factor of K. 

C.3 Further Comparisons for SEP, DSEP and full EP 

The assumption we made in the main text to achieve SEP sa full EP is that the contributions of 
each likelihood term to the posterior are similar. We show further results here on the approximation 
produced by different EP methods when there is significant heterogeneity in the data. We generated 
synthetic XOR classification data by sampling from 4 unit Gaussians with means (3, 3), (—3, —3), 
(3, —3) and (—3,3), and labelling the clusters centered at the former two as negative examples (and 
positive for the others). The model p(y n \x n , 9) is kernel probit regression using RBF kernel with 
width l = 1.0, which is the same as the model presented in Section 5.1 in the main text except that 
the features are changed to kernel representations. This makes the feature vectors high dimensional, 
and the local nature of kernels also makes the kernel-expanded inputs very different if the datapoints 
belong to different clusters. We generated 50 * 4 test data and {10 * 4, 20 * 4, 50 * 4} training data 
and ran SEP/DSEP/full EP to approximate the posterior distribution of 9. For DSEP we partitioned 
the dataset into 4 subsets according to the associated centroid. Each experiment was repeated 10 
times to collect average test data log-likilihood and classification errors. 

Table 2 shows the qualitative numbers of performances and Figure 4 visualises the contuors of 
probability p(y = 1 \x. D) with true posterior approximated by q{9). Interestingly SEP is slightly 
better then the others on the classification error metric. But importantly EP achieves the best test 
log-likelihood numbers and in general DSEP produces very similar results (shown by both the table 
and the figure), meaning that even for small datasets running full EP might be unnecessary. Also 
the three methods become indistinguishable when the size of the dataset N increases. We argue the 
main reason is that the posterior contributions are getting similar since more datapoints are observed 
in the circle of kernel width. 

We further tested the robustness of all three methods to outliers. We reused the settings above and 
randomly flipped 10% labels of training data. Qualitative results in Figure 5 show that SEP is almost 
as robust as DSEP/EP in this example. We had tried different types of outliers and failed to find the 
cases where EP/DSEP significantly outperforms SEP. Future work should further characterises that 
when SEP gives bad approximations and separately whether it fails in the same way as EP fails, 
e.g. EP can fail to converge. 

3 code adjusted from ep-stan: https : / /github . com/gelman/ep- stan 
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Table 2: Average test results of all methods on kernel Probit regression. 


N 

SEP 

mean error 

DSEP 

EP 

SEP 

test log-likelihood 

DSEP EP 

10*4 

20*4 

50*4 

0.032±0.0058 

0.007±0.0014 

0.003±0.0010 

0.055±0.0127 

0.008±0.0024 

0.003±0.0014 

0.032±0.0097 

0.012±0.0031 

0.006±0.0009 

-0.405±0.011 
-0.3264=0.007 
-0.243±0.004 

-0.380±0.010 -0.3784=0.009 
-0.320±0.006 -0.317±0.003 
-0.233±0.007 -0.238±0.003 
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Figure 4: Comparing predictions of kernel probit regression trained by SEP/DSEP/EP, with increas¬ 
ing training data size N. 


7 





















0.5 

. o' • • 

4 

0.5 

. 'O' . 

2 

. • 

03 0 

0 

0.5 

<9 

•. ■<? 

° «•: •. • • ■ 

-2 

• . <? 

o • 

°- 3 . ^ 

-4 

4 

o 

. " w . 


(a) SEP, N = 40 


(b) DSEP, N = 40 



9 # ■ 


■0.4. 


o:6 

t* 

54 


a. 4 


. - s o ;v. 

v ^7.‘* o . 



*?> - 


-6 -4 -2 0 2 4 6 

(g) SEP, AT = 200 


-6 -4 -2 0 2 4 6 

(h) DSEP, N = 200 



-4 -2 

0 2 4 6 

(c) EP, N = 40 

0- 6 . 

•. ' 

•( ft 

• * V 
o .*.* • . * 

■jp» 


6.5 . 

0.4 

.* ( v » x . 

tf.V 

s-°, 

-4 -2 

0 2 4 6 

(f) EP 

N = 80 

o,6-. • 

: . 

. nsT 

• s. .v 

0.5* 

0.4» . 

b.t> 

. .o ' o 

. ** 

. V 

O ■ ; 'I' T \ 

o?> 

l 

cn 

1 

■£» 

1 

NJ 

0 2 4 6 

(i) EP 

N = 200 


1.0 


0.8 

0.6 

0.4 

0.2 


0.0 


1.0 

0.8 

0.6 

0.4 

0.2 


0.0 


1.0 

0.8 

0.6 

0.4 

0.2 


0.0 


Figure 5: Comparing predictions of kernel probit regression trained by SEP/DSEP/EP, with 10% 
labels flipped. 
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